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Abstract 

We present a new lossy compressor for discrete- valued sources. For coding a sequence x n , the encoder starts 
by assigning a certain cost to each possible reconstruction sequence. It then finds the one that minimizes this cost 
and describes it losslessly to the decoder via a universal lossless compressor. The cost of each sequence is a linear 
combination of its distance from the sequence x n and a linear function of its k th order empirical distribution. The 
structure of the cost function allows the encoder to employ the Viterbi algorithm to recover the minimizer of the 
cost. We identify a choice of the coefficients comprising the linear function of the empirical distribution used in 
the cost function which ensures that the algorithm universally achieves the optimum rate-distortion performance 
of any stationary ergodic source in the limit of large n, provided that k diverges as o(logn). Iterative techniques 
for approximating the coefficients, which alleviate the computational burden of finding the optimal coefficients, are 
proposed and studied. 



I. Introduction 

Consider the problem of universal lossy compression of stationary ergodic sources described as follows. Let 
X = {X;;V i £ M + } be a stochastic process and let X denote its alphabet which is assumed discrete and finite 
throughout this paper. Consider a family of source codes {C„}„>i. Each code C n in this family consists of an 
encoder /„ and a decoder g n such that 

/„ : X n -> {0, 1}*, (1) 

and 

g n :{0,l}*^X n , (2) 

where X denotes the reconstruction alphabet which also is assumed to be finite and in most cases is equal to X. 
{0, 1}* denotes the set of all finite length binary sequences. The encoder /„ maps each source block X n to a binary 
sequence of finite length, and the decoder g n maps the coded bits back to the signal space as X n = g n (f n (X n )). Let 
ln(fn(X n )) denote the length of the binary sequence assigned to sequence X n by the encoder f n . The performance 
of each code in this family is measured by the expected rate and the expected average distortion it induces. For a 
given source X and coding scheme C n , the expected rate R n , and expected average distortion D n , of C n in coding 
the process X are defined as follows: 

R n = E[-l n (f n (X n ))], (3) 
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and 



D n =E[d n (X n ,X")]±E 



1 " 

n. — ' 



n 

i=l 



(4) 



where X n = g n (f n {X n )), and d : X x X — > R + is a per-letter distortion measure. 

For a given process and any rate R > 0, the minimum achievable distortion (cf. fT) for exact definition of 
achievability) is characterized as (2), J4) 

£>(i?, X) = lim min E[d n (X n ,X n )}. (5) 

™^ M p(X™|X™):/(X™;X™)<_R 

Similarly, for any distortion D > 0, define i?(-D, X) to denote the minimum required rate for achieving distortion 
D, i.e., 

R(D,X) = min r. 

-D(i\X)<£> 

Universal lossy compression codes are usually defined in the literature in one of the following modes Q: 
I. Fixed-rate: A family of lossy compression codes {C n } is called fixed-rate universal, if for every stationary 
ergodic process X, R n < R, Vn > 1, and 

limsupL> n = D(R,X). 

n 

II. Fixed-distortion: A family of lossy compression codes {C n } is called fixed-distortion universal, if for every 
stationary ergodic process X, D n < D,\/n > 1, and 

limsupi?„ = R{D,X). 

n 

III. Fixed-slope: A family of lossy compression codes {C n } is called fixed-slope universal, if there exists a > 0, 
such that for every stationary ergodic process X 

limsup[i?„ + aD n ] = mm[R(D, X) + aD]. 

n D>0 

Existence of universal lossy compression codes for all these paradigms has already been established in the 
literature a long time ago J6), Q, JS), J9), iflOl . ifTTl . The remaining challenging step is to design universal lossy 
compression algorithms that are implementable and appealing from a practical viewpoint. 

A. Related prior work 

Unlike lossless compression, where there exists a number of well-known universal algorithms which are also 
attractive from a practical perspective (cf. Lempel-Ziv algorithm lfl2l or arithmetic coding algorithm [13|), in lossy 
compression, despite all the progress in recent years, no such algorithm is yet known. In this section, we briefly 
review some of the related literature on universal lossy compression with the main emphasis on the progress towards 
the design of practically appealing algorithms. 

There have been different approaches towards designing universal lossy compression algorithms. Among them 
the one with longest history is that of tuning the well-known universal lossless compression algorithms to work 
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for the lossy case as well. For instance, Cheung and Wei lfT4l extended the move-to-front transform to the case 
where the reconstruction is not required to perfectly match the original sequence. One basic tool used in LZ-type 
compression algorithms, is the idea of string-matching, and hence there have been many attempts to find optimal 
approximate string-matching. Morita and Kobayashi 031 proposed a lossy version of LZW algorithm, and Steinberg 
and Gutman |[T6l suggested a fixed-database lossy compression algorithms based on string-matching. Although the 
extensions could all be implemented efficiently, they were later proved to be sub-optimal by Yang and Kieffer ifTTl . 
even for memoryless sources. Another related example, is the work by Luczak and Szpankowski which proposes 
another suboptimal compression algorithm which again uses the ideas of approximate pattern matching ifTSIl . For 
some other related work see |fl9l l20lll2D . 

Another well-studied approach to lossy compression is Trellis coded quantization (22) and more generally vector 
quantization (c.f. ll23l . lF24l and the references therein). Codes of this type are usually designed for a given 
distributions encountered in a specific application. For example, such codes are used in image compression (JPEG) 
or video compression (MPEG). Nevertheless, there have been attempts at extending such codes to more general 
settings. For instance Kasner, Marcellin, and Hunt proposed universal Trellis coded quantization which is used in 
the JPEG2000 standard |25l . 

There has been a lot of progress in recent years in designing non-universal lossy compression algorithms of 
discrete memoryless sources. Some examples of the recent work in this area are as follows. Wainwright and 
Maneva ll26l proposed a lossy compression algorithm based on message-passing ideas. The effectiveness of the 
scheme was shown by simulations. Gupta and Verdu proposed an algorithm based on non-linear sparse-graph codes 
l27l . Another algorithm with near linear complexity is suggested by Gupta, Verdu and Weissman in |28l . The 
algorithm is based on a 'divide and conquer' strategy. It breaks the source sequence into sub-blocks and codes the 
subsequences separately using a random codebook. Finally, the capacity-achieving polar codes proposed by Arikan 
l29l for channel coding are shown to be optimal for lossy compression of binary-symmetric memoryless sources 
in 01. 

The idea of fixed-slope universal lossy compression was first suggested by Yang, Zhang and Berger in Q. They 
proposed a generic fixed-slope universal algorithm which leads to specific coding algorithms based on different 
universal lossless compression algorithms. Although the constructed algorithms are all universal, they involve 
computationally demanding minimizations, and hence are impractical. In 0, the authors considered lowering the 
search complexity by choosing appropriate lossless codes which allow to replace the required exhaustive search by 
a low-complexity sequential search scheme that approximates the solution of the required minimization. However, 
these schemes only find an approximation of the optimal solution. 

In a recent work OTI . a new implementable algorithm for fixed-slope lossy compression of discrete sources was 
proposed. Although the algorithm involves a minimization which resembles a specific realization of the generic cost 
proposed in 13, it is somewhat different. The reason is that the cost used in BP cannot be derived directly from 
a lossless compression algorithm. The advantage of the new cost function is that it lends itself to rather naturally 
Gibbs simulated annealing in that the computational effort involved in each iteration is modest. It was shown that 
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using a universal lossless compressor to describe the reconstruction sequence found by the annealing process to the 
decoder results in a scheme which is universal in the limit of many iterations and large block length. The drawback 
of the proposed scheme is that although its computational complexity per iteration is independent of the block 
length n and linear in a parameter k n = o(log n), there is no useful bound on the number of iterations required for 
convergence. 

In this paper, motivated by the algorithm proposed in OTll . we propose another approach to fixed-slope lossy 
compression of discrete sources. We start by making a linear approximation of the cost used in J3T). The cost 
assigned to each possible reconstruction sequence consists of a linear combination of two terms: a linear function 
of its empirical distribution plus its distance to (distortion from) the source sequence. We show that there exists 
proper coefficients such that minimizing the linearized cost function results in the same performance as would 
minimizing the original cost. The advantage of the modified cost is that its minimizer can be found simply using 
the Viterbi algorithm. 

B. Organization of this paper 

The organization of the paper is as follows. In Section [II] the count matrix of a sequence and its empirical 
conditional entropy is introduced and some of their properties are studied. Section [Til] reviews the fixed-slope 
universal lossy compression algorithm used in 03111 . Section IPVl describes a new coding scheme for fixed-slope 
lossy compression derived by replacing part of the cost used in the mentioned exhaustive-search algorithm by a 
linear function. We prove that using appropriate coefficients for the linear function, the performance of the two 
algorithms remains the same. In Section [V] a method for approximating these optimal coefficients is presented. 
This method, along with the result of the previous section, gives rise to a fixed-slope universal lossy compression 
algorithm that achieves the rate-distortion performance for any discrete stationary ergodic source. The advantage of 
this modified cost is discussed in Section [VTl where we show that the minimizer of the new cost can be found using 
the Viterbi algorithm. The method introduced for approximating the coefficients is computationally demanding, and 
hence is impractical. Therefore, in Section IVIII we discuss a low-complexity iterative detour for approximating 
the coefficients. Section [Villi presents some simulations results and, finally, Section ITXl concludes the paper with a 
discussion of some future directions. 

II. Conditional empirical entropy and its properties 

For any y n £ y n , let the \y\ x \y\ k matrix m(y n ) denote its (k + l) th order empirical distributiorQ. For 
b = (61, ... , bk) G y k , and j3 S y, the element in the /3 th row and the b th column of the matrix m, m^.b, is 
defined as 

mp, h (y n ) ± i |{1 < 1 < n : y^ =b, yi =0\}\, (6) 

'For any set A, \A\ denotes its size. 
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where here and throughout the paper we assume a cyclic convention whereby = iji+ n for i < 0. 

Based on the distribution induced by m(y n ), define the fc th order conditional empirical entropy of y n , Hk(y n ), 

as 

H k (y n ) 4 H(Z k+1 \Z k ), (7) 

where Z k+1 is assumed to be distributed according to m, i.e., 

P (Z k+1 = [&!,..., 6 fc) /3] = [b, /?]) = ro Ab (|/ n ). (8) 

For a vector v = (vi, . . . ,vt) T with non-negative components, we let H(v) denote the entropy of the random 
variable whose probability mass function (pmf) is proportional to v. Formally, 

{ if v = (0,...,0) T , 

where 01og(0) = by convention. With this notation, the conditional empirical entropy Hk(y n ) defined in (0 is 
readily seen to be expressible in terms of m(y n ) as 

H k (y n ) = H( m (y n )) ^Y, H ( m -.b) E m ^ ( 10 > 

b pay 

where m. b denotes the column of m indexed by b. 

Remark 1: Note that Hk(-) has a discrete domain, while the domain of H(-) is continuous and consists of all 
\y\ x \y\ k matrices with positive real entries adding up to one. In other words, 

H k -.y n ^ [o,i], (ii) 

but 

H : [(Up 1 x [0,l] mk -> [0,1]. (12) 

Conditional empirical entropy of sequences, i?fc(-), plays key role in our results. Hence, in the following two 
subsections, we focus on this function, and study some of its properties. 

A. Concavity 

We prove that like the standard entropy function, conditional empirical entropy is also a concave function. By 
definition 

H(m) = X) (£ mp, h )U(m., h ), (13) 
bey k Pey 

where %{■) is defined in (|9j. We need to show that for any 6 £ [0, 1], and matrices and m^ 2 ) with non-negative 
components adding up to one, 

0#(m (1) ) + <9F (m (2) ) < H(6m {1) + 0m (2) ), (14) 
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where 6—1—6. From the concavity of entropy function H, it follows that 

*(£ ™™ )W(m$) + fl(£ mgi,)W(m. ( J) 

Cfl/-V^ Mi , a/V^ (2) \\ ^(Sfley m ^,b) t,/ 

= W£S,b) + ^£ m kb)) E ™ (ik^- " ^KT^Kb) 

/3ey /3ey ie{i,2} W^ey TO ,3,bJ + ^Iz^o 1 TO /J,b" 

< 0? (£ ™^ b ) + *(£ 4 2 b))^( 0m ' b + (15) 

/3ey pay 

where 6^ = 1 — 8 2 = 6. Summing up both sides of <JT5j over all b G y k yields the desired result. 
B. Stationarity condition 

Let p(y k+1 ) be a given pmf defined on y k+1 . Under what condition(s) does there exist a a stationary process 
with its (k + l) th order distribution equal to pi 

Lemma 1: The necessary and sufficient condition for {p{y k+1 )} y k+i^yk+i to represent the (k + l) th order 
marginal distribution of a stationary process is 

£ y k ) = E K»*> fl> v y k G < 16 > 

pay pay 

Proof: 

i. Necessity: The necessity of (fTol i is just a direct result of the stationarity of the process. If p(y k+1 ) is to represent 
the (k + l) th order marginal distribution of a stationary process Y = {Yj}, then it should be consistent with 
the fc th order marginal distribution. Hence, (fT~6T > should hold. 

ii. Sufficiency: In order to prove the sufficiency, we assume that dT6b holds, and build a stationary process with 
(k + l) th order marginal distribution equal to p(y k+1 ). Let Y = {Yi]i be a Markov chain of order k whose 
transition probabilities are defined as 

P(Y fc+1 = y k+1 \Y k = y k ) 4 q(y k+1 \y k ) 4 ^^2, (17) 

p(y ) 

where 

= £?(&!/*) = £ 

/3e}> pay 

Now, given ( fTol l. it is easy to check that p(y k+1 ) is the (k + l) th order stationary distribution of the defined 
Markov chain. Therefore, Y is a stationary process with the desired marginal distribution. 

■ 

Throughout the paper, we refer to the condition stated in ( TT6b as the stationarity condition. 
Corollary 1: For any \y\ x \y\ k matrix m corresponding to the (k + l) th order empirical distribution of some 
y n G y n , there exists a stationary process whose marginal distribution coincides with m. 
Proof: From Lemma [T] we only need to show that ( fT6b holds, i.e., 

E m p* = E m 6 k ,[wi..,M- Vb e y k > 

pey pay 

which obviously holds because both sides of dl~8b are equal to \{i : y^f k — b}|/(n — k). ■ 
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III. Exhaustive search algorithm 
Consider the following lossy source coding algorithm. Given a > 0, for encoding sequence x n € X n , find 

x n = axgmm[H k (y n ) + ad n (x n ,y n )}, (19) 

and describe x n using the Lempel-Ziv coding algorithm. As proved before |5), E), the described algorithm is a 
universal lossy compression algorithm. That is, for any stationary ergodic source X, 

-l L z(X n )+ad n (X n ,X n ) -> mm[R(D,X) + aD], a.s., (20) 
n 

where X n is generated by the source X, and X n denotes the minimizer of ( fT9l for the input X n . Here ^lz denotes 
the length of the codeword assigned to X n by the Lempel-Ziv algorithm fl2l . Clearly, given the size of the search 
space, this is not an implementable algorithm. An approach for approximating the solution of ( TT9b using Markov 
chain Monte Carlo methods has been suggested in J3T|. One problem with the MCMC -based algorithms is that no 
useful bound is yet known on the required number of iterations. Moreover, the performance of the algorithm depends 
on the cooling process chosen. There exist cooling schedules with guaranteed convergence, but they are very slow, 
and usually not used in practice. On the other hand, if we use faster cooling processes, there is a risk of getting 
stuck in a local minima and missing the optimum solution. The goal of this paper is to propose a new approach for 
approximating the solution of ( fT9l ). This new approach, as we show later, suggests a new implementable algorithm 
for lossy compression. The main idea here is using linear approximation of the conditional entropy function, H(m), 
at some point mo, and proving that if mo is chosen correctly, then while we have reduced the exhaustive search 
algorithm to the Viterbi algorithm, we have not changed its performance. 

IV. Linearized cost function 

Consider the problems (PI) and (P2) described by (fSTJ and d22l) respectively, where (PI) corresponds to the 
optimization required by the exhaustive search lossy compression scheme described in ( fT9l , and (P2) involves a 
similar optimization problem. The difference between (PI) and (P2) is that the term corresponding to conditional 
empirical entropy in (PI), which is a highly non-linear function of m, is replaced by a linear function of m. 



(PI): min [H(m(y n )) + ad n (x n , y n )\ , (21) 
v 



and 



(P2) : min 



P b 



H Vbm/3,b(y n ) + ad n (x n , y n ) 



(22) 



where {A^bj/^b are a set of real-valued coefficients. In this section we are interested in answering the following 
question: 

Is it possible to choose the set of coefficients {A^.b}^,b, (3 £ X and b G X k , such that (PI) and (P2) have the 
same set of minimizers, or at least the set of minimizers of (P2) is a subset of the minimizers of (PI)? 
The reason we are interested in answering this question is that if the answer is affirmative, then instead of solving 
(PI) one can solve (P2), which we describe in Section [VTl can be done efficiently via the Viterbi algorithm. 
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Let Si and £2 denote the set of minimizers of (PI) and (P2) respectively. Consider some z n e Si, and let 
m* = m(z"), and let the coefficients used in (P2) 

Q 

= bg(^% (23) 

m /3,b 

Theorem 1: If the coefficients used in (P2) are chosen according to d23l . then the minimum values of (PI) and 
(P2) will be the same. Moreover, 

S 2 c Si 

and contains all the sequences w n 6 Si with m(w n ) = m* . 

Proof: Since, as proved earlier, H(m) is concave in m, for any empirical count matrix m, we have 



H(m) < H(m*) + V ^ ff(m) 



(mp,b ~ m* 0th ) (24) 



HI., 



= H(xa*)+^2X Ptb (mp, h -m} tb ) (25) 
0,b 

= H(m). (26) 

Adding a constant to the both sides of d26l l, we conclude that for any y n g X n , 

H(m(y n )) + ad n (x n 7 y n )<H(m(y n )) + ad n (x n ,y n ). (27) 

Taking the minimum of both sides of (|27| | yields 

mm[H(m(y n )) + ad n (x n ,y n )} < mm[H (m(y™)) + ad n (x n , y n )} (28) 
y" v" 

< H(m(z n )) + ad n (x n , z n ) (29) 
= H{m{z n ))+ad n (x n ,z n ) (30) 



= mm[H(m(y n )) + ad n (x n ,y n )}, (31) 

y n 



because z n G S±. Therefore, 



mm[H(ni(y n ))+ad n (x n ,y n )} = mm[H(m(y n )) + ad n (x n 7 y n )}, (32) 

yn yTL 

i.e., (PI) and (P2) have the same minimum values. 

For any sequence w n with m(w") 7^ m*, by strict concavity of H(m), 

H(m(w n )) + ad n {x n ,w n ) > H(m(w n )) + ad n (x n , w n ), (33) 

>mm[H k (y n )+ad n (x n ,y n )}. (34) 

y n 

Hence, the empirical count matrices of all the sequences in S2, i.e., all the minimizers of (P2) for the selected 
coefficients, are equal to m*. 
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Let w n € 1S2. We prove that w n G Si as well. As we just proved, m(u>") = m(z") = m*. Moreover, since 
both z n and w n belong to S2, 

mm[H(m{y n )) + ad n (x n ,y n )} = H(m(w n )) + ad n (x n ,w n ) 

y n 

= H(m(z n ))+ad n {x n ,z n ). (35) 

Therefore, d n (x n ,w n ) = d n (x n , z n ), and consequently, 

H k {w n ) + ad n {w n ,x n ) = H k {z n ) + ad n {z n ,x n ), 

= min[H k (y n )+ad n (y n ,x n )], (36) 

y n 

which proves that w n G <Si, and concludes the proof. ■ 
Theorem Q] states that if the optimal type m* is known, then the desired coefficients can be computed according 
to ( l23l , and solving (P2) instead of (PI) using the computed coefficients finds a minimizer of (PI). In Section [Vll 
we describe how (P2) can be solved efficiently using Viterbi algorithm for a given set of coefficients. The problem 
of course is that the optimal type m* required for computing the desired coefficients is not known to the encoder 
(since knowledge of m* seems to require solving (PI) which is the problem we are trying to avoid). In Section 
[VI we introduce another optimization problem whose solution is a good approximation of m*, and hence of the 
desired coefficients {A^ b} when substituting in ( |23l . 

V. Computing the coefficients 

As mentioned in the previous section, there exists a set of coefficients for which (PI) and (P2) have the same 
value. However, computing the desired coefficients requires the knowledge of m* which is not available without 
solving (PI). In order to alleviate this issue, in this section we introduce another optimization problem that gives 
an asymptotically tight approximation of m* , and therefore a reasonable approximation of the set of coefficients. 

For a given sequence x n and a given order k, let A4^ = M^{x n ) be the set of all jointly stationary probability 
distributions on (X k , X k ) (in the sense of LemmaQ]) such that their marginal distributions with respect to X coincide 
with the fc th order empirical distribution induced by x n defined as follows 



|{1 < i < n : (xi- k , . . .,Xi-i) = a k }\ 



1 n 

El i-i k , 
x i-k= a ' 



n 
i=l 



(37) 



where a k G X k . More specifically a distribution p^> in Ai 1 --^ should satisfy the following two constraints: 
1) Stationarity condition: as described in Section Hl-BI for any a k ~ 1 G X k ~ 1 and b k ~ 1 G X k ~ 1 , 

£ pW(a\b k )= £ pW(a k a k -\b k b k - 1 ). (38) 



a k £X,b k €X a k ex,b k ex 
2) Consistency: for each a k G X k , 



P {k \a k ,b k )=p[^ ] (a k ). (39) 



b k ex k 
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For given x n , k and I > k, consider the following optimization problem 

min H(X k+1 \X k ) +aEd{X 1 ,X 1 ) 
s.t. {X l ,X l )~pM 

(40) 

Remark 2: Note that the rate-distortion function of a stationary ergodic process X has the following representation 



R(D,X) =inf{ff(X) : (X,X) jointly stationry and ergodic, and Ed(X ,X ) < D}, 

= inf mf{H(X k+1 \X k ) : (X, X) jointly stationary and ergodic, and F,d(X Q ,X ) < D}, (41) 

k>l 

where H(X.) denotes the entropy rate of the stationary ergodic process X, i.e., 

JT(X)4 lim H{X n+1 \X n ). (42) 

n— >oo 

This representation gives the motivating intuition behind the optimization described in (140b . It shows that ( f40b 
is basically performing the search required by 1(4 11 . 

Using the properties of the set M.^\ and the definition of conditional empirical entropy, d40b can be written 
more explicitly as 

min H(m) + a d(a,b)q(a,b) 

a£X,beX 

s.t. 0<pW(a £ ,6 f ) < 1, Va'e^^ef, 

P {£ Ha e ,b t )= P {e] (ata e -\b k b^), 
a e ex,b e ex a e ex,b e ex 

Va M eX 1 ' 1 ,^- 1 G X*- 1 , 

b^ex 1 

q(a,b)= Yl P {e) (aa e -\bb^) 

a'-ie^- 1 ,!)'-^^- 1 

m , h = P {e \a e ,b(3b e ' k ), V/3,b. (43) 

Note that the optimization in d43l is done over the joint distributions of (X^, X^). Let V* denote the set of 
minimizers of (g3]l, and S* be their (k + l) th order marginalized versions with respect to X. Let {A^b}^,b be the 
coefficients evaluated at some rh* e <S* using (f23b . Let X be a stationary ergodic source, and i?(X, I?) denote its 
rate distortion function. Finally, let X n be the reconstruction sequence obtained by solving (P2) (recall d22l) ) at the 
evaluated coefficients. 
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Theorem 2: If k = k n = o(logn), £ = £ n = o(n 1 / 4 ) and k = o{£) such that k n ,£„ — > oo, as n — > oo, then for 
any stationary ergodic source 

H k (X n ) + ad n {X n ,X n ) n ^ mm [i?(X, L>) + <xD] , a.s. (44) 

The proof of Theorem |2] is presented in Appendix A. 

Remark 3: Theorem [2] implies the fixed-slope universality of the scheme which does the lossless compression 
of the reconstruction by first describing its count matrix (costing a number of bits which is negligible for large n) 
and then doing the conditional entropy coding. 

Remark 4: Note that all the constraints in d43b are linear, and the cost is a concave function. Hence, overall, we 
have a concave minimization problem (of dimension \X\ l \X\ l + \X\ k+1 + \X\\X\). 

VI. VlTERBI CODER 

In this section, we show how, for a given set of coefficients, {At,,^}, (P2) can be solved efficiently via the Viterbi 
algorithm ||33l, ||34l. 

Note that the linearized cost used in (P2) can also be written as 



1 ™ 

^2 [ A /3,b^ jb (y™) + ad n {x n ,y n )] = - ^ [-V.j/'i^ + ad(x l ,y i ) 



(45) 



hex 



The advantage of this alternative representation is that, as we will describe, instead of using simulated annealing, 
we can find the sequence that exactly minimizes (05]) via the Viterbi algorithm, which is a dynamic programming 
optimization method for finding the path of minimum weight in a Trellis diagram efficiently. For i = k+ 1, . . . , n, 
let 

Si = Vt-k (46) 

to be the state at time i, and define S to be the set of all |A"| fe+1 possible states. From this definition, the state at 
time i, Si, is determined by the state at time i — 1, s;_i, and yi. In other words, s\ = g(si~i,yi), for some 

j : S x i -> 5. 

This representation leads to a Trellis diagram corresponding to the evolution of the states {sj}" =fe+1 in which each 
state has \X\ states leading to it and \X\ states branching from it. To the edge e = (s',s) connecting states s' and 
s = b k+1 at stage i, we assign the weight Wi(e) defined as 

Wi(e) := X bk+ub k + ad(xi,b k +i). (47) 

In this representation, there is a 1-to-l correspondence between sequences y n G X n , and sequences of states 
{si}" =fc+1 , and minimizing (l45T l is equivalent to finding the path of minimum weight in the corresponding Trellis 
diagram, i.e., the path {si}™ =fc+1 that minimizes J2?=k+i w i{ e i)> wnere e 2 ; = (si_i,Sj). Solving this minimization 
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can readily be done by the Viterbi algorithm which can be described as follows. For each state s, let C(s) be the 
\X\ states leading to it, and for any i > 1, define 

Ci(s):= minK((s', «)) + C«_i («')]• (48) 
s'e£(s) 

For i — 1 and s = 6 fc+1 , let Ci(s) := A 6fc+1 b & + adfc + i(a; fc+1 , Using this procedure, each state ,s at each 

time j has a path of length j — k — 1 which is the minimum path among all the possible paths between the states 
from time i = k + 1 to i = j such that Sj = s. After computing {C,(s)} for all se5 and all z G {fc + 1, . . . , n}, 
at time i = n, let 

s* = arg min C n (s). (49) 

s£S 

It is not hard to see that the path leading to s* is the path of minimum weight among all possible paths. 

Note that the computational complexity of this procedure is linear in n but exponential in k because the number 
of states increases exponentially with k. Therefore, given the coefficients {Xb.p}, solving (P2) is straightforward 
using the Viterbi algorithm. The problem is finding an approximation of the optimal coefficients. The procedure 
outlined in Section [IV] for finding the coefficients involves solving a concave minimization problem of dimension 
that becomes intractable even for moderate values of n. To bypass this process, an alternative heuristic method 
is proposed in the next section. The effectiveness of this approach is discussed in the next section through some 
simulations. 

VII. Approximating the optimal coefficients 

As we discussed in Section |IV] having known the optimal coefficients, solving (P2) which can be done using 
the Viterbi algorithm is equivalent to solving (PI) which has exponential complexity in n. However, the problem 
is finding such desired coefficients. In Section [VJ it was proposed that for finding a good approximation of these 
coefficients, one method is to solve d43l l and find rh*. Then an approximation of the coefficients {A^ b} can be 
made via ( f23l > by evaluating the partial derivatives of H(m) at rh*. But solving d43l > requires solving a concave 
minimization problem of dimension which is demanding for even moderate values of n. Therefore, in this section, 
we consider a detour with moderate computational complexity. 

First, assume that the desired distortion is small, or equivalently a is large. In that case, the distance between 
the original sequence x n and its quantized version x n should be small. Therefore, their types, i.e., their (k + l) th 
order empirical distributions, are close. Hence, the coefficients computed based on m(x n ) provide a reasonable 
approximation of the coefficients derived from m*. This implies that if our desired distortion is small, one possibility 
is to compute the type of the input sequence, and evaluate the coefficients at m(x n ). 

In the case where the desired distortion is not very small, we can use an iterative approach as follows. Start with 
m(.T™). Compute the coefficients from (l23l l at m(x n ). Employ Viterbi algorithm to solve (P2) at the computed 
coefficients. Let x n denote the output sequence. Compute m(x n ), and recalculate the coefficients using (l23l l at 
m(x"). Again, use Viterbi algorithm to solve (P2) at the updated coefficients. Iterate. 
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For a conditional empirical distribution matrix m, define its coefficient matrix as A(m), where A^b is defined 
as (f23t . For two matrices A and B of the same dimensions, define the scalar product of A and B as 



Now succinctly, the iterative approach can be described as follows. For t = 0, let 

y n,{0) _ x n For i = 1,2,... 

A ( *> = A(m(y n '^)), 
y n ' w = argmin[A (t) © m(z n ) + ad(x n , z n )\. 

Stop as soon as y n >w = y n '( t ~ 1 \ 

For a given sequence x n , and slope a, assign to each sequence y n 6 X n the energy 

£(y n ) = H k (y n ) + ad(x n ,y n ). (50) 

As mentioned before, the goal is to find the sequence with minimum energy. Theorem [3] below gives some 
justification on how the described approach serves this purpose. It shows that, through the iterations, the energy 
level of the output is decreasing at each step. Moreover, since the number of energy levels is finite, it proves that 
the algorithm converges in a finite number of iterations. 

Theorem 3: For the described iterative algorithm, at each t > 1, 

£ (y ,h{t+1) ) < £(y ,h{t) ). (51) 

Proof: For the ease of notations, let x n = y n '^\ rh = m(i"), and A = A(m). Similarly, let x n = y n ^ t+1 \ 
m = m(x"), and A = A(m). From the concavity of H(m) in m, 

^(rh) < 7J(m) + A©(m-m), (52) 

where AO B with A and B two matrices of the same dimensions is equal to J2 a i,jbi,j- On the other hand 

A m = A^bm^.b 
0,h 



^ rnfi.b log 
/3,b 



(53) 



= H(m). (54) 

Therefore, combining d52l and (f53l > yields 

ff(rii) < A m. (55) 
Adding a constant term to the both sides of ( |56t . we get 

= F(m) + i") < A rh + ad(a; n , i"). (56) 
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But, since x n is assumed to be a minimizer of (P2) for the computed coefficients, 

A©m + ad(x n ,x n ) < A m + ad(x n , x n ) 
= H{m)+ad{x n ,x n ) 

= £{x n ) (57) 
Therefore, combining (T56b and (|57] > yields the desired result, i.e., 

£{x n ) < £(x n ). (58) 

■ 

Remark 5: In the described iterative algorithm, for any slope a, we assumed that the algorithm starts at j/ n >(°) = 
x n . However, as mentioned earlier, only for large values of a, m(x n ) provides a reasonable approximation of the 
desired type m*. Hence, in order to address this issue, we can slightly modify the algorithm as follows. The idea 
is that instead of starting at 

y n,(0) = x n 

for all values of a, we can gradually decrease the slope to our desired 
value, and use the final output of each step as the initial point for the next step. More explicitly, for any given ao, 
start from some large slope, a max , (corresponding to very low distortion). Run the previous iterative algorithm and 
find x n (a max ). Pick some integer N a , and define 



Aa 



A ^max ^0 



Again run the iterative algorithm, but this time at a = a max — Aa. Now, instead of starting from 

y n,(o) = x n^ 

initialize y n '(°) = i"(a max ). Repeat this process N a times. I.e, At the r th step, r = 1, . . . , N a , run the algorithm 
at a = a max — rAa, and initialize y n >(°> = £"(a max — (r — l)Aa). At the final step a = ao, and we have a 
reasonable quantized version of x n for initialization. 

To gain further insight on (P2), for the coefficients matrix A = {Ap t h}p,b, define 



4>(A) — min 



^ A (3:b TO / 3,b(y™) + ad n (x n ,y n ) 



= min [A®m{y n ) + ad n (x n ,y n )}. (59) 

Since ^(A) is the minimum of multiple affine functions of A, it is a concave function. To each sequence y n 6 X n , 
assign a coefficient matrix A = [A^.b] as 

dH{m 



A 



(60) 

m(y n ) 



3m/3,b 

Let Cd be the set of all such coefficient matrices. Similarly to each possible conditional distribution matrix m 
on X k+1 which satisfies the stationarity condition defined in Section Hl-BI assign a coefficients matrix A defined 
according to d60l i. Let C c be the set of coefficient matrices calculated at (k + l) th order stationary distributions on 
X k+1 . Note that while C d is a discrete set (consisting of no more than |3^|™ elements), C c is continuous. 
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For a sequence x n , let 

x n = argminf (y n ), 

and 

A* 4 A(x n ). 

Note that A* is the optimal coefficients matrix required for replacing (PI) with (P2). 
Lemma 2: 

A* = arg min 0(A). (61) 

Aec d 

Proof: As shown before, 

/(A) = £(x n ). (62) 
On the other hand, if x" is the minimizer of ^p bmp b(y n ) + ced n (x n , y n ) for some A g then, as shown 

/3,b 

in the proof of Theorem Q] 

H (m) < A m. (63) 
Therefore, adding d(x n ,x n ) to both sides of d63l yields 

£(£ n ) < <?!)(A). (64) 

But, by assumption, 

£{x n ) < £(x n ). (65) 

Combining d62l . d64l l and d65l l yields the desired result. ■ 
Remark 6: Note that 

min min V" \p,bmp t b(y n ) +ad n (x n ,y n ) 



mm 

y n 



min A ( 3 !b m (3jb (2/ n ) + ad n (x n , y n ) 

e£c 1 s,b 



(66) 



But H(m(y n )) < \p,hmpj 3 (y n ), for any A G £ c , and the lower bound is achieved at A(y n ). Therefore, 

p,b 

min min I V \ p , b m p , h (y n ) + ad n (x n ,y n ) ) = xmn(H k (y n ) + ad(x n , y n )). (67) 
AeCc yn j y n 

Hence, we can replace d by C c in d6Tb . and still get the same result. This transform converts the discrete 
optimization stated in doTt , which can be solved by exhaustive search, to an optimization over a continuous function 
of relativley low dimentions. 
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0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 

D 



Fig. 1. Average performance of the iterative Viterbi-based lossy coder applied to an i.i.d. Bern(0.5) source, (n = 10 4 , k = 8, a = 
(3,2.9, . . . ,0.1). and L = 50) 

VIII. Simulation results 

As the first example, consider an i.i.d. Bcrn(p) source with p = 0.5. Fig.[T]shows the performance of the iterative 
algorithm described in Section I VIII slightly modified, as suggested in Remark [5] The simulations parameters are 
as follows: n = 10 4 , k = 8, and a = (3, 2.9, . . . , 0.1). Each point corresponds to the average performance over 
L = 50 independent source realizations. As mentioned in Section [VTA the iterative algorithm continues until there 
is no decrease in the cost. Fig. [2] shows the average, minimum and maximum number of required iterations before 
convergence versus a. Again, the number of trials are L = 50. It can be observed that the number of iterations in 
this case is always below 60, which, given the size of the search space, i.e, 2", shows fast convergence. 

The next example involves a binary symmetric Markov source (BSMS) with transition probability q = 0.2. Fig. [3] 
compares the average performance of the Viterbi encoder against upper and lower bounds on R(D) ll35l . The reason 
for only comparing the performance of the algorithm against bounds on R(D) in this case is that the rate-distortion 
function of a Markov source is not known, except for a low-distortion region. For low distortions, the Shannon 
lower bound is tight J36). More explicitly, for D < D c & 0.0159, 

R(D) = H b (q) - Hb(D), 
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Fig. 2. From top to bottom: average, minimum and maximum number of iterations before convergence, (i.i.d. Bcrn(0.5) source, n = 10 4 , 
k = 8, a = (3,2.9, . . . ,0.1), and L = 50) 



where H b {e) 4 ft( e , 1 - e). For D > D c , R(D) > H b (q) - H b {D). 

A comparison with the memoryless case (Fig. [TJ seems to suggest that the problem is less with how quickly (in 
n) we are converging to the exhaustive search performance scheme of (fT9b than with how quickly the convergence 
in d44b is taking place, which is source dependent and not at our control. 

Fig. [4] shows the average number of iterations before convergence versus a. It can be observed that the average 
is always below 15. To give some examples on how the energy is decreasing, Fig. [3] and Fig. [6] show the energy 
decay through iterations for a = 1.6 and a = 1 respectively. 

Remark 7: Similar to BTI . here in the figures we are using Hk{x n ) as the rate, while in fact it is not a true 
length function. The reason is that as explained in I3T1 . by Ziv inequality ll37l . if k = o(log(n)), then for any 
e > 0, there exits N e <E IN such that for any n > N e and any sequence = (yi,y2, ■ ■ ■), 



-4z(y") - H k (y n ) 
n 



< e. 



(68) 



IX. Conclusions 

In this paper, a new approach to for fixed-slope lossy compression of discrete sources is proposed. The core 
ingredient is the use of the Viterbi algorithm, which is a dynamic programing algorithm. It enables the encoder to 
find the reconstruction sequence with minimum cost. The encoder first assigns some weights to different contexts 
of length k, i.e, subsequences of length k + 1, that appear within the reconstruction sequence. Then, the overall 
cost assigned to each possible reconstruction sequence is the sum of the weights of different contexts multiplied by 
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Fig. 3. Average performance of the iterative Viterbi-based lossy coder applied to a BSMS with q = 0.2 source, (n = 25 X 10 3 , k = 8, 
a = 3 : -0.1 : 0.1 and L = 50) 

their number of appearances in the sequence, plus some constant times the distance between the original sequence 
and the candidate reconstruction sequence. From this definition, it turns out that the state of the Viterbi algorithm 
at time t is the last k symbols observed plus the current symbol in the sequence, i.e, (yt-k, ■ ■ ■ ,Ut)- Therefore, 
the Trellis has overall different states, corresponding to |A , | fc+1 different possible contexts of length k. 

Hence for coding a sequence of length n, the computational complexity of the Viterbi algorithm will be of the 
order of 0(n2 k+1 ). We prove that there exists a set of optimal coefficients for which the described algorithm will 
achieve the rate-distortion performance for any stationary ergodic process. The problem is finding those weights. 
We provide an optimization problem whose solution can be used to find an asymptotically tight approximation of 
the optimal coefficients resulting in an overall scheme which is universal with respect to the class of stationary 
ergodic sources. However, solving this optimization problem is computationally demanding, and in fact infeasible 
in practice for even moderate blocklengths. In order to overcome this problem, we propose an iterative approach 
for approximating the optimal coefficients. This approach is partially justified by a guarantee of convergance to at 
least a local minimum. 

In the described iterative approach, the algorithm starts at a large slope (corresponding to a small distortion) and 
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Fig. 4. Average number of iterations before convergence. (BSMS with q = 0.2, n = 25 X 10 3 , k = 8, a = 3 : -0.1 : 0.1 and L = 50) 



gradually decreases the slope until it hits the desired value. At each slope, the algorithm runs the Viterbi algorithm 
iteratively until it converges. An interesting possible next step is to explore whether there exisits a sequence of slopes 
converging to the desired value in a small number of steps (e.g. of o(n)) for which we can guarantee convergence 
of the algorithm to the global minimum at the end of the porcess. Existance of such sequence of slopes implies a 
universal lossy compression algorithm with moderate computatioal complexity. 
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APPENDIX A: Proof of Theorem|2] 

Proof: By rearranging the terms, the cost that is to be minimized in (PI) can alternatively be represented as 
follows 

n 

H k {y n ) + ad n (x n , y n ) = H k (m(y n )) + a- V d(x ly Vi ), 

i=l 
1 " 

= H k (m(y n )) + a-^2d(xi,yi) 2J 1 (x i , S(i )=(a,6) 

i=1 a£X,beX 
1 ™ 

= H k (m(y n )) +a~Y^ d ( a > & ) 1 (^ lW )=(o,6) 

1 " 

= ff fc (m(y n ))+a ^ d(a,6)- £l (x 



.,»<)=(<»,&) 

(1) 



tf fe (m(y")) + a £ d(a, ^ (a, b) 



aex.bex 



H (h+i) (Y k+ i\Y k ) + «E (i) d(Xi,Fi). (A-l) 

P[„n] P[x",h™] 



November 23, 2010 



21 



0.408 




Fig. 6. Energy decay through the iterations for a = 1. (BSMS with q = 0.2, n = 25 X 10 and k = 8) 



This new representation reveals the close connection between (PI) and ( f40b - Although the costs we are trying 
to minimize in the two problems are equal, there is a fundamental difference between them: (PI) is a discrete 
optimization problem, while the optimization space in d40l ) is continuous. 

Let £* and V* be the sets of minimizers of (PI), and joint empirical distributions of order £, p) „ induced 
by them respectively. Also let S* be the set of marginalized distributions of order k + 1 in V* with respect to Y. 
Finally, let C* and C* be the minimum values achieved by (PI) and (T43b respectively. 

In order to make the proof more tractable, we break it down into several steps as follows. 

1) Let y n E £*, and p^„ y „-, be the induced joint empirical distribution. It is easy to check that pf^ n y „, satisfies 
all the constraints mentioned in d43l i. The only condition that might need some thought is the stationarity 
constraint, which also holds because 



j i-e+i 



E P&, yn] (a*,b e ) = ±\{l<i< 



■}|. 



a e £X,b e eX 



(A-2) 



a e GX,b e eX 
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Therefore, since C* is the minimum of d43l ), we have 

C* < H k (m(y n )) + aE w (%,y w ) 
= ff fc (m(y")) + ad„(x", 2/ n ) 

- C*. (A-3) 

2) Let p*W g p*. Based on this joint probability distribution and x n , we construct a reconstruction sequence 
X n as follows: divide x n into r = \j] consecutive blocks: 

A , . . . , • i ( r _2)f+li J '( r _l)f+l; 

where except for possibly the last block, the other blocks have length i. The new sequence is constructed as 
follows 

v£ v2£ y(f—X)i yn 

where for i = 1, . . , , r — 1, is a sample from the conditional distribution p*w = 

and X^ r _ l)l+l ~p* w (^"("-iK+il^"(r-i)^+i = x "r-i)f+i)- 

3) Assume that x = {xj}^ is a given individual sequence. For each n, let p*( fe+1 ) be the (k + l) th order 
marginalized version of the solution of d43l on .Y^ 1 ). Moreover, let X n be the constructed as described in 
the previous item, and pf^"t, be the (k + l) th order empirical distribution induced by X n . We now prove 

[A "J 

that 

|| p .(fc+i)_^+i)|| 1 _^ 0i a . S-) (A . 4) 



where the randomization in iA-4i is only in the generation of X T 



Remark 8: Since p*W satisfies stationarity condition, its (fc + l) th order marginalized distribution, p*( fc+1 ), 
is well-defined and can be computed with respect to any of the (k + 1) consecutive positions in 1, ...,£. In 
other words for a k+1 <E X k+1 , 

p *{k + i ){a k+i )= p< k+1 \Va k+1 b^\- 1 ), (A-5) 

for any j G {0, . . . , I — — 1}, and the result does not depend on the choice of j. 

In order to show that the difference between p;^ , (a k+1 ) and p*( fc+1 ) (a k+1 ) is going to zero almost surely, we 

[X n \ 

decompose , (a k+1 ) into the average of £ — k terms each of which is converging to p*( fe+1 )(a fe+1 ). Then 

[A "J 

using the union bound we get the desired result which is the convergence of P^^Ma^ 1 ) to p*^ k+1 \a k+1 ). 
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For a k+1 £ X k+1 , 



C:,v +l )-P* (fc+1 v +1 ) 



- r>< k+V >(n k+l 
n 2^ J Xl_ k =a^ P ( a 

i=l 

l - 'e 1 E ^ + * - ( flfe+1 ) 



j=0 1=1 



^ l-k-1 j r-1 

- E ; E ^tl-^^ 1 

j=0 L i=l 



+ Sl _ p *(*+l)( *+l) 



j=0 L i=l 



+ <5 2 -p* (fc+1 )(a fc+1 ) 



(A-6) 



where 5i accounts for the edge effects between the blocks, and 82 is defined such that 82 — Si takes care of 
the effect of replacing £ with j^. Therefore, < S 1 < ^±Dl + £=i < aj^ll + I and |j 2 _ = 
Hence, <5i — >• and 82 —> as n —> 00. 

The new representation decomposes a sequence of correlated random variables, {1^ = a k + 1 }f=k+i> ^ n ^° 
£ — k sub-sequences where each of them is an independent process. For achieving this some counts that lie 
between two blocks are ignored, i.e., if ly* =a fc+i is such that it depends on more than one block of the 

i — k 

form X l ^_ 1 ^ £+V we ignore it. The effect of such ignored counts will be no more than 8 r which goes to zero as 
k, £ — >• 00 because the theorem requires k = o(£). More specifically in (IA-6I 1. for each j £ {0, ...,£ — k— 1}, 
{lj>u-j _ k +i}i=i is a sequence of independent not necessarily identically distributed random variables. 

it— j — k a 



For n large enough, j^l < e/2. Therefore, by Hoeffding inequality Il38ll . and the union bound, 
P 



(\p^{a k+1 )-p* {k+X \a k ^)\>e), 
l-k-l r 1 r-l 

h E -«-»' <t+1, <°' + '> 

j=0 L i=l 

\ 3=0 t=i 

^E'pfjE^^--^^ 1 ) >f), 



> 



>C 2 



< 2{£ - k)e 



-re 2 /2 



(A-7) 
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Again by the union bound, 

p(n4t ] i) -^ (fe+i) 
< 



l > e 



E p(|^V+Vp* (fe+1 V +1 ) 



> 



m^ 1 



< |,Y| fe+1 2^- fc)e «l*l 2t * +1) . (A-8) 

Our choices of k = k n = o(logn), i — £ n = o(n 1 ^ 4 ), k = o(£), and k n: £ n —> oo, as n — >• oo now guarantee 
that the right hand side of (IA-8b is summable on n which together with Borel-Cantelli Lemma yields the 
desired result of (lA-4b . 
4) Using similar steps as above we can prove that 



(A-9) 



Again we first prove that \q*(a, b) — - (a, b)\ — »■ for each a € X and b G X. For doing this we again 
need to decompose 

{^ Xi =a,X z =b}i=l 

into I sub-sequences each of which is a sequence of independent random variables, and then apply Hoeffding 
inequality plus the union bound. Finally we apply the union bound again in addition to the Borel-Cantelli 
Lemma to get our desired result. 

5) Combing the results of the last two parts, and the fact that H k (m) and E g d(X, Y) are bounded continuous 
functions of m and q respectively, we conclude that 

H k (X n ) + ad n {x n ,X n ) = H (Y k+1 \Y k ) + a E. (1) d(X u Fi) 

= H p , (k+1) (Y k+1 \Y k ) + aE q , diX^Yi) + e„ 

= C n + e n , 

where e n — > with probability 1. 

6) Since C* is the minimum of (PI), we have 

C* <H k (X n )+ad n (x n ,X n ), 

— C n + £n ■ 

On the other hand, as shown in flA-31 ), C* < C*. Therefore, 

\c* n -c:\^o 

as n — > oo. 

7) For a given set of coefficients A = {A^bj^.b computed at some m according to (1231) . define 



(A-10) 



(A-ll) 



(A- 12) 



/(A) = min 



E x P,bmp,b{v n ) + ad n (x n ,y n ) 

/3,b 



(A- 13) 
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It is easy to check that / is continuous, and bounded by 1 + a. Therefore, since A is in turns a continuous 
function of m, and as proved in ( IA-41 ). 

|| p *( fe+ i)_^+i)|| l ^ 0) 

we conclude that, 

|/(A*)-/(A)|^0, (A-14) 

where A* and A are the coefficients computed at p*( fe+1 ) and p,^. , respectively. 
8) Let X n be the output of (P2) when the coefficients are computed at m(X n ). Then, from Theorem[3] 

H k (X n ) + ad n (x n ,X n ) < H k (X n ) + ad n {x n ,X n ) 

= C* + e n . (A-15) 

Since, e„ — > 0, this shows that haven computed the coefficients at m(X n ), we would get a universal lossy 
compressor. But instead, we want to compute the coefficients at m*. From ( IA- 14b . the difference between 
the performances of these two algorithms goes to zero. Therefore, we finally get our desired result which is 

'H k (X n )+ad n {X n ,X n ) 
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